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Abstract 

Understanding social dynamics that govern human phenomena, such as communications and social rela- 
tionships is a major problem in current computational social sciences. 

In particular, given the unprecedented success of online social networks (OSNs), in this paper we are 
concerned with the analysis of aggregation patterns and social dynamics occurring among users of the 
largest OSN as the date: Facebook. 

In detail, we discuss the mesoscopic features of the community structure of this network, considering 
the perspective of the communities, which has not yet been studied on such a large scale. In fact, first 
we acquired a sample of this network containing millions of users and their social relationships; then, 
we unveiled the communities representing the aggregation units among which users gather and interact; 
finally, we analyzed the statistical features of such a network of communities, discovering and charac- 
terizing some specific organization patterns followed by individuals interacting in online social networks, 
that clearly emerge even if considering different sampling techniques and clustering methodologies. 

The implications of this study reflect the ability of individuals of exploiting social interactions in such a 
way as to create a well-connected online social structure and open space for further social studies. 



Introduction 



Social media and online social networks (OSNs) represent a revolution in Web users behavior that is 
spreading at an unprecedented rate during the latest years. Online users aggregate on platforms such as 
Facebook and Twitter creating huge social networks of millions of persons that interact and group each 
other. People create social ties constituting groups based on existing relationships in real life, such as on 
relatives, friends, colleagues, or based on common interests, shared tastes, etc. 

In the context of computational social sciences, the analysis of social dynamics, including the description 
of those unique features that characterize online social networks, is acquiring an increasing importance 
in current literature (lH5|- 



One of the big challenges for network scientists is to provide techniques to collect [6j and process [7] data 
in an automatic fashion, and strategies to unveil those features that characterize these type of complex 
networks (8). These methods should be capable of working in such large-scale scenarios [9]. 

Amongst all the relevant problems in this area, the analysis of the so-called community structure of online 
social networks acquired relevant attention during latest years [2 10 15 . 



From a sociological perspective, studying the community structure of a network helps in explaining social 
dynamics of interaction among groups of individuals 5jl6 , including classic hypotheses such as Milgram's 



small world theory [17] , Granovetter's strength of weak ties theory [18], Borgatti's and Everett's core 
periphery structure theory 19 20 , and so on. 
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On the other hand, discovering and analyzing the community structure is a topic of great interest for its 
economical and marketing implications [21] . For example, it could be possible to improve the advertising 
performance by ensuring that the most influential users of each community are targeted, exploiting effects 



such as the word- of -mouth and the spread of information within the community 22 . Similarly, it could 
be possible to exploit the affiliations of users to communities to provide them useful recommendations on 
the base of common interests shared with friends |23] . 

Finally, the community detection problem has plenty of challenges from a computational perspective, 



since it is highly related to the problem of clustering large (possibly heterogeneous 124] ) datasets 25 ■ 29 



In this work we are concerned with the analysis of the community structure of the largest online social 
network as to date: Facebook. In particular, first we acquire a sample from the Facebook social graph 
(i.e., the network of relationships among the users), then we apply two different state-of-the-art algorithms 
to unveil the underlying community structure (see the Appendix for technical details.) 

The further analysis of the mesoscopic features of this network puts into evidence the social dynamics 
and the organization patterns that describe online social network users behavior on a large scale. 

In detail, our study shows a number of surprising results, and among them we will discuss for example: 

(i) The emergence of a tendency of social network users at the formation of communities whose size 
follows a power law distribution, which means that there exist several groups of small size and a decreasing 
number of groups or larger size. 

(ii) The number of interconnections that exists among communities also follows a typical power law 
behavior, that provides some clues in the direction of the assessment of the strength of weak ties theory. 



foreseen by the early work of Granovetter 18 



(Hi) The community structure of the network is clearly defined, regardless the methods adopted to 
unveil the community structure, which gives support to the significance of our results and provides some 



guarantees with respect to well-known problems such as the bias introduced by sampling procedures 30 



and the resolution limit suffered by some types of community detection algorithms 31 32 



(iv) Communities present a high degree of clustering, which is an indicator of the presence of the so- 
called small world phenomenon, whose existence in real-world social networks has been assessed during 



the sixties by Milgram 17 



(v) The effective diameter of the community structure of Facebook is small, (i.e., around 4 and 5), 
according to the six-degrees of separation theory and the empirical evaluation that has been recently 



carried out by Facebook by using heuristic techniques 33 34 . 



Methods 

The aim of this work is to analyze the mesoscopic features of the community structure of the Facebook 
social network. In the following we provide some information about the process of data collection, briefly 
discussing the sampling methodology and the techniques adopted to collect data. 

This is the first step to study the community structure of real-world networks, that reflect unique char- 



acteristics which are impossible to replicate by using synthetic network models 35 



After that, we discuss the process of community detection by which we unveiled the community structure 
of the network (and, to this regard, additional technical details are discussed in the Appendix.) 
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Finally, we describe the process of definition of the community structure network, to which it follows its 
analysis and discussion of findings. 



Sampling the Facebook network 

Similarly to other online social network platforms, Facebook does not release public accessible information 
regarding the overall social network structure - in order to protect the privacy of its users. This lack of 
data availability has been faced acquiring public information directly from the platform, by means of a 
sampling process. 

During this study we did not inspect, acquire or store personal information about users, since we were 
interested only in reconstructing the social connections among a sample of them. To this purpose, we 
designed a data mining platform with the only ability to visit the publicly accessible friend-list Web 
pages of specific users, selected according to a sampling algorithm. Obtained data have been used only 
to reconstruct the large-scale community structure sample studied in this work. 

The architecture of the designed mining platform is briefly schematized as follows. We devised a data 
mining agent, which implements two sampling methodologies (breadth-first search and uniform sampling.) 
The agent queries the Facebook server(s) in order to request the friend-list Web pages of specific users. In 
detail, the agent visits those Web pages containing the friend-list of a given user, following the directives 
of the chosen sampling methodology, and extracts the friendship relationships reported in the publicly 
accessible user profile. 

The sampling procedure runs until any termination criterion/a is/are met (e.g., a maximum running 
time, a minimum size of the sample, etc.), concluding the sampling process. Collected data are processed 
and stored on our server in anonymized formaiQ post-processed, cleaned and filtered according to further 
requirements. 



The sampling methodologies 

In the following we briefly discuss two statistical sampling methods adopted in this work, namely the 
breadth- first- search and the uniform sampling. 



The breadth-first-search sampling 

The first sampling methodology implemented is the breadth-first-search (BFS), an uninformed graph 
traversal algorithm. Starting from a seed node, it explores its neighborhood; then, for each neighbor, it 
visits its unexplored neighbors, and so on, until the whole network is visited (or, alternatively, a termina- 
tion criterion is met.) This sampling technique has several advantages with respect to other techniques 



(for example, random walks sampling, forest fire sampling, etc.) as discussed in recent literature 36 37 
One of the main advantages is that it produces a coherent graph whose topological features can be 
studied. 

For this reason it has been adopted in a variety of OSNs mining studies [T 38-41 . During our experimen- 
tation, we defined the termination criterion that the mining process did not exceed 10 days of running 
time. Observing a short time-limit, we ensured a negligible effect of evolution of the network structure 
(less than 2% overall, according to the heuristic calculation provided in [39].) The size of the obtained 



1 Data are represented in a compact format in order to save I/O operations and then are anonymized, in order not to 
store any kind of private data (such as the user-IDs.) 



4 



(partial) graph of the Facebook social network has been adopted as yardstick for the uniform sampling 
process. 



The uniform sampling 

The second chosen sampling methodology is a rejection-based sampling technique, namely the uniform 
sampling. The main advantage of this technique is that it is unbiased for construction, at least in 
its formulation for Facebook. Details about its definition are provided by Gjoka et al. [39]. The process 
consists of generating an arbitrary number of user-IDs, randomly distributed in the domain of assignment 
of the Facebook user-ID system. In our case, it is the space of the 32-bit numbers, thus, the maximum 
amount of assignable user-IDs is 2 32 , about 4 billions. As of August 2010 (the period during which we 
carried out the sampling process), the number of subscribed users on Facebook was about 500 millions, 
thus the probability of randomly generating an existing user-ID was ps 12.5% (i.e., 1/8.) 

The sampling has been set up as follows: first we generated a number of random user-IDs, lying in 
the interval [0, 2 32 — 1], equal to the dimension of the BFS-sample multiplied by 8. Then, we queried 
Facebook for their existence. Our expectation was to obtain a sample of comparable dimensions with 
the BFS-sample. Actually, we obtained a slightly smaller sample, due to the restrictive privacy settings 
imposed by some users, who configured their profile preventing the public accessibility of their friend-lists. 
The issue of the privacy has been investigated in our previous works [40] . 



Description of the acquired samples 



All the user-IDs contained in the samples have been anonymized using a 48-bit hashing functions 42 
in order to hide references to users and their connections. Data have been post-processed for a cleansing 
step, during which all the duplicates have been removed, and the integrity and congruency of data have 
been verified. The characteristics of the samples are reported in Table [T] The size of both the samples is 
in the magnitude of millions of nodes and edges. 

The anonymized datasets acquired and studied in this work have been publicly released^ 



Some of the statistical and topological features of these networks have been discussed in previous work 40 
and our main previous findings can be summarized as follows: 

• From the networks it emerges that the degree distribution is defined by power law as P(x) = .t~ a . 
In detail, it is possible to define two regimes, dividing the domain into two intervals (tentatively 
1 < x < 10 and x > 10), whose exponents are Xf FS = 2.45, Af F5 = 0.6 and X% NI = 2.91, 
\UNi _ q 2 respectively for the BFS and the uniform sample, in agreement with recent studies by 



Facebook 33 34 



• Regarding the diameter of the networks, the BFS sample shows a diameter in agreement with the 
six-degrees of separations, thanks to the "wavefront expansion" behavior of the sampling algorithm, 
which produces a plausible graph; differently, the uniform sample over-estimates the diameter, 
possibly because the largest connected component does not cover the whole graph. 

• Regarding the clustering coefficient, we observed that the average values for the both the samples 
fluctuate in a similar interval reported by recent studies on OSNs [4 39 , confirming the presence 
of the well-known small world phenomenon. 



2 http: / / www.emilio.ferrara.namc / datasets / 
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Detecting communities 

Given the large scale of our Facebook samples, containing millions of nodes and edges, most of community 
detection algorithms could not deal with it. In order to unveil the community structure of these networks 
we adopted two computationally efficient algorithms: (i) Label Propagation Algorithm (LPA) | 43| , and 
(ii) Fast Network Community Algorithm (FNCA) [44] . 

In the following we present the main advantages given from their choice and their performance. 



Advantages and performance of chosen methods 

The problem of choosing a particular community detection algorithm is crucial if the aim is to unveil the 
community structure of a network. In fact, the choice of a given methodology could affect the outcome of 
the experiments. In particular, there are several algorithms which depend on tuning specific parameters, 
such as the size of the communities in the given networks, and/or their number (for additional information 



see recent surveys 25-29.) 



In this study, the purpose was to discover the unknown community structure of Facebook, and to do so 
we choose two different techniques which rely just on the topology of the network to unveil its community 
structure. 

LPA (Label Propagation Algorithm) is an algorithm for community detection with a near liner cost based 
on the paradigm of label propagation. Its computational efficiency makes it well suited for the discovery 
of communities in large scale networks, such as in our case. LPA only exploits the network structure as 
guide and does not follow any pre-defined objective function to maximize (differently from FNCA); in 
addition, it does not require any prior information about the communities, their number or their size. 

FNCA (Fast Network Community Algorithm) is a computationally efficient method to unveil the com- 
munity structure from large scale networks. It is based on the maximization of an objective function 
called network modularity, and it does not require prior information on the structure of the network, the 
number of communities present in the network or their size. 

Even though the paradigms on which the algorithms rely are different, a common feature emerges: 
their functioning is agnostic with respect to the considered network. This aspect makes them a feasible 
choice considering that we do not have any prior information about the characteristics of the community 
structure of Facebook. Further technical details regarding these methods are discussed in the Appendix 
of this paper. 

In order to assess the significance of the community structure obtained by using these algorithms, min- 



imizing the risk of introducing bias due to the community detection process 32 , we will discuss the 
similarity of outcomes provided by the algorithms. 

The performance of the LPA and FNCA on our Facebook samples is shown in Table [2] Both the 
algorithms successfully unveil its community structure. High values of network modularity have been 
obtained in all the samples, which suggest the presence of a community structure. 

The community structure has been represented by using a list of vectors which are identified by a 
"community-ID" ; each vector contains the list of user-IDs (in anonymized format) of the users belonging 
to the given community; an example is depicted in Table [3j 
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Building the community meta-network 

To study the mesoscopic features of the community structure of Facebook, we abstracted a meta-network 
consisting of the communities, as follows. We built a weighted undirected graph G' — (V',E',uj), whose 
set of nodes is represented by the communities constituting the given community structure. In G there 
exists an edge e' uv £ E' connecting a pair of nodes u, v £ V if and only if there exists in the social 
network graph G = (V, E) at least one edge &y £ E which connects a pairs of nodes i,j £ V, such that 
i £ u and j £ v (i.e., user i belongs to community u and user v belongs to community j.) The weight 
function is simply defined as 

U u>v = Bij (1) 

(i.e., the sum of the total number of edges connecting all users belonging to u and v.) 

Table [6] summarizes some characteristics of the networks obtained for the uniform sample by using FNCA 
and LPA. Something which immediately emerges is that the overall statistics obtained by using the two 
different community detection methods are very similar. The number of nodes in the meta-networks is 
smaller than the total number of communities discovered by the algorithms, because we excluded all 
those "communities" containing only one member (whose consideration would be in antithesis with the 
definition of community in the common sense.) 

We discuss results regarding the community structure and its mesoscopic features in the following. 



Results 

The analysis of the community structure of Facebook will focus on the following aspects: (i) first, 
we evaluate the quality of the communities identified by means of the community detection algorithms 
described above. This step includes assessing the similarity of results obtained by using different sampling 
techniques and clustering methods. In detail, we evaluate the possible bias introduced by well-known 



limitations of these techniques (e.g., the resolution limit for modularity maximization methods 31 32 
the sampling bias due to the incompleteness of the sampling process [30].) (ii) Second, we investigate 
the mesoscopic features of the community structure meta-network considering some characteristics of the 
network (such as the diameter, the distribution of shortest-paths and weights of links, the connectivity 
among communities, etc.), discussing how these features may reflect on social dynamics and organization 
patterns of individuals within the network. 



Analysis of the community structure 

The first question that this analysis addresses is the distribution of the size of the communities discovered. 
This feature is an important indicator of the quality of the community structure discovered by using 



computational techniques. In fact, from the literature 13 it emerges that the most of online social 
networks reflects a power law distribution in the size of the communities. This means that there should 
exist a large amount of communities whose size is very small and a very small amount of large communities. 

Figures [l] and [2] represent the probability distribution of the presence of communities of given size, for the 
two considered samples (i.e., uniform and BFS), by using the two chosen community detection algorithms 
(i.e., LPA and FNCA.) From the analysis of these figures, first it emerges that in both cases results 
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produced by the different community detection algorithms are very similar. Moreover, both distributions 
resemble a power law behavior (since, in the log-log plot, the slope of the curves is almost linear). 

The analytical results reported in Table [2] combined with the plots, suggest that both the algorithms 
identified a similar amount of communities regardless the adopted sampling method. This is also reflected 
by the very similar values of network modularity obtained for the two different sets. Moreover, the size 
of the communities themselves seems to coincide for most of the times. 

A detailed analysis of the previous considerations follows. First, we discuss the distribution of the size of 
the communities. Then we consider their structural similarity. 

Power law distribution of the community size 

Both the distributions obtained by using LPA and FNCA resemble a characteristic power law. To confirm 
this hypothesis, we computed the best fitting function to a power law distribution obtaining that: (i) 
the distributions of communities obtained for the uniform sample, are fitted to a power law function 
P(k) oc k~ 7 with 7 = 1.07 which effectively approximates their behavior (p — value = 1.879 • 10~ 2 ); (ii) 
the results produced by the for the BFS sample fit to a power law function with 7 = 0.72 (p — value = 
4.049 • 10~ 2 ). 

In Figures [T] and [2] a logarithmic scale has been adopted in order to emphasize the power law behavior. 
In detail, by considering the distribution of community sizes within the uniform sample (Figure [T]) , it 
emerges a linear behavior which is described by a power law. 

Differently, the BFS sample (Figure [2]) shows some fluctuations. The difference in the behavior between 
the BFS and uniform samples reflects accordingly with the adopted sampling techniques. In fact, it has 
been recently put into evidence |30[|40| that a sampling algorithm such as the BFS may affect the sample 
towards high degree nodes, in case of incomplete visits. Interestingly, this is testified by Figure [2j from 
which it emerges the fact that in the BFS sample, there exist communities, tentatively lying in the size 
interval 50 > x > 200, that are in greater number with respect to what would be expected by a power 
law behavior. 

To the best of our knowledge, we report the first case on a large scale, in which it emerges that the 
bias towards high degree nodes introduced by the BFS sampling method reflects on the features of the 
communities identified by two different methods (relying on different paradigms.) To this regard, we 
could indicate as more appropriate those rejection-based methods (such as the uniform sampling) to the 
purpose of studying the community structure of networks on a large scale. 

The second question we address is the quality of the community structure obtained by using FNCA 
and LPA. The idea that two different algorithms could produce different community structures is not 
counterintuitive, but in our case we have some clues that the obtained results could share a high degree 
of similarity. To this purpose, we investigate the similarity between the community structures obtained 
by using the two different algorithms. 

Community structure similarity 

In order to evaluate the similarity of two community structures we adopt a variant of the Jaccard coeffi- 
cient, called binary Jaccard coefficient. It is defined as 
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where M\\ represents the total number of shared elements between vectors v and w, Mq\ represents the 
total number of elements belonging to w and not belonging to v, and, finally M 10 the vice-versa. The 
result lies in [0,1]. 

The adoption of the binary Jaccard coefficient is due to the following consideration: if we would compute 
the simple intersection of two sets {i.e., the community structures) by using the classic Jaccard coefficient, 
those communities differing even by only one member would be considered different, while a high degree 
of similarity among them could still be envisaged. We avoid this issue adopting the binary Jaccard 
coefficient, by comparing each vector of the former set against all the vectors in the latter set, in order 
to match the most similar one. The mean degree of similarity is computed as 



N max ( J(v, w)j 

£^Tv L W 



where max(J(v, w)j) represents the highest value of similarity chosen among those calculated combining 
the vector i of the former set with all the vectors of the latter set. We obtained the results as in Table HJ 

In addition, to establish the correlation between the distributions we adopt a divergence measure, called 
Kullback-Leibler divergence, defined as 



D KL (P\\Q)=Y, P ^ l °S^ (4) 

where P and Q represent, respectively, the probability distribution that characterizes the behavior of the 
LPA and the FNCA community sizes, calculated on a given sample. Let i be a given size such that P(i) 
and Q(i) represent the probability that a community of size i exists in the distribution P and Q. The 
KL divergence is helpful if one would like to calculate how different are two distributions with respect to 
one another. In particular, being the KL divergence defined in the interval < Dkl < oo, the smaller 
the value of KL divergence between two distributions, the more similar they are. 

We calculated the pairwise KL divergences between the distributions discussed above, finding the following 
results. 

(i) on the "Uniform" sample: 

• D kl {P lpa \\P FNC a) = 7.722 

• D KL {P FNCA \\P LPA ) = 7.542 

(ii) on the BFS sample: 

• D kl {P lpa \\P FNC a) = 3.764 • 10- 3 

• D KL {P FNCA \\P LPA ) = 4.292 • 10- 3 

The values found by adopting the KL divergence put into evidence a strong correlation between the 
distributions calculated by using the two different algorithms on the two different samples. 

From results it emerges that both these algorithms produce a statistically significant community structure 
for the Facebook network. In fact, while the number of identical communities between the two sets 
obtained by using, respectively, BFS and uniform sampling, is not so high (i.e., respectively, ss 2% and 
ps 35%), the overall mean degree of similarity is very high (i.e., « 73% and ss 91%.) This is due to 
the high number of communities which differ only for a very small number of components. Finally, the 




9 



fact that the median is, respectively, ~ 75% and ~ 99%, and that the very majority of results lie in one 
standard deviation, demonstrates the strong similarities of the obtained communty structures. 

Figures [3] and [4] graphically highlight these findings. Their interpretation is as follows: on the x-axis and 
on the y-axis there are represented the communities discovered for the FNCA and the LPA methods, 
respectively. The higher the degree of similarity between two compared communities, the higher the 
heat-map scores. The similarity is graphically evident considering that the values of heat shown in the 
figures are very high (i.e., greater than 0.7) for the most of the heat-map. 



Resolution limit and bias 



Recently 31 , in the context of detecting communities by adopting the network modularity as maximiza- 
tion function, a resolution limit has been put into evidence. In particular, in [31] , the authors found 
that modularity optimization could, depending on the topology of the network, cause the inability of the 
process of community detection to find communities whose size is smaller than (i.e., in our case 

ss 3, 000.) This reflects in another effect, that is the creation of big communities that include a large part 
of the nodes of the network, without affecting the global value of network modularity. 

Being the most of the communities revealed smaller than that size and well distributed according to a 
power law^J we may hypothesize that the community structure unveiled by the algorithm for our samples 
is unlikely to be affected by the resolution limit. 

Even though, assuming the possibility that the characteristics of the our networks may be affected by the 
adopted sampling method, we investigate the effect of the resolution limit on the community structure. 
In particular, we analyze the presence of communities that suspiciously exceed the size that would be 
expected according to the power law distributions discussed above, henceforth called outlier communties. 

The results of our analysis on the BFS and the uniform samples could be discussed separately. On the 
former sample, a small number of outlier communities has been identified, in particular for the FNCA 
method, possibly because of the resolution limit effect suffered by FNCA, which a community detection 
algorithm based on the paradigm of the network modularity maximization. 

Table [5] reports the amount of outlier communities suspected of suffering of bias. Regarding the BFS 
sample, from the analysis it emerges that LPA produces less outlier communities than FNCA. 

Different considerations hold for the uniform sample, that apparently does not suffer of bias or of the 
resolution limit problem. Regarding the FNCA, a large number of communities whose dimension is 
slightly greater than one thousand members, represents those communities coincident with the tail of the 
power law distribution, depicted in Figure [T] and can not be considered outlier communities. Similarly, 
the LPA method in the uniform sample provides the most reliable results, without incurring in any bias 
reflected by outlier community, or effect of the resolution limit. 



Discussion 

In the following discussion of results we consider the uniform sample and the community structure 
unveiled by the LPA as yardstick for our investigation. Our discussion focuses in particular on three 
aspects: (i) assessment of the mesoscopic features of the community structure of the network and their 
implications in terms of social dynamics; (ii) study of the connectivity among communities and how it 

3 We recall that the presence of a power law distribution with a clear amount of small communities is important also for 
the evaluation of the resolution limit [311. 
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reflects on users organization patterns on a large scale; (Hi) ability of inferring additional insights by 
means of visual observation of the community structure. 

This work then concludes putting into evidence implications, strength and limitations of our discussion. 
Mesoscopic features of the community structure 

The purpose of this section is investigating and discussing the mesoscopic features of the community 
structure of Facebook. This aspect includes finding patterns that emerge from the network structure, 
and in particular those which are related not to individuals or to the overall networks, but concerned 
with those aggregation units that are the communities among which users gather. 

To this purpose, we first discuss the degree distribution of communities discovered by means of our 
methods (i.e., FNCA and LPA) in the uniform sample. We report Figure [5j that shows the degree 
probability distribution as a function of the degree in the cases discussed above. Analyzing the degree 
distribution of the community structure meta-network we find a very peculiar feature. In detail, these 
distributions are identified by two different regimes, tentatively 1 < x < 10 2 and x > 10 2 . Both regimes 
fit well to a power law, defined as P(x) oc x^ 1 with 7 = 0.56 for the former regime and 7 = 3.51 for the 
latter regime. Interestingly, such a particular behavior has been previously found in the Facebook social 
graph [39] . 

The presence of a scaling law in the degree distribution has been put in correlation with the so-called 



self-organization of human networks 45 . Self-organization is the ability of individual to coordinate and 



organize in patterns or structures which are proven to be efficient, robust and reliable. For example, 



efficiency could be expressed in terms of minimizing costs for diffusing information 46 47 , robustness 



could be represented by the presence of redundant connections that link the same groups and reliability 



by the ability of the network to well-react to errors and malfunctioning 48 50 



Interestingly, self-organization is a phenomenon which is known to happen in small world networks 
[46][47][5H[52] and in their community structure [53]. In the light of this assumption, we investigated the 
presence of the small world effect in the community structure of Facebook. To this purpose, a reliable 
indicator of the presence of this phenomenon is the clustering coefficient - i.e., the tendency to the 
creation of closed triangles among triads of communities. In our context, the clustering coefficient of a 
community is the ratio of the number of existing links over the number of possible links between the 
given community and its neighbors. Given our meta-network G = {V, E), the clustering coefficient Ci of 
community i € V is 

a = 2\{(v,w)\(i,v), (i,w), {v,w) e E}\/ki{h - 1) 
where k t is the degree of community i. 

It can be intuitively interpreted as the probability that, given two randomly chosen communities that 
share a common neighbor, there also exists a link between them. High values of average clustering 
coefficient indicate that the communities are well connected among each other. This result would be 
interesting since it would indicate a tendency to the small world effect. 

We plotted the average clustering coefficient probability distribution for the community structure in 
Figure [6j From its analysis it emerges that the distribution is surprisingly described by a power law 
P(x) oc a; -7 , whose exponent is 7 — 0.48. The slope of this curve is smooth, which allows for a the 
existence of a high probability of finding communities with large clustering coefficient, irrespectively of 
the number of connections they have with other communities. 

This interesting fact reflects the existence of a very tight and proficiently connected core in the community 
structure [19[|20| , and the small world effect allows for an efficient information spreading as a result of 
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the presence of short-paths connecting communities. In fact, it is reasonable to suppose that, randomly 
selecting two disconnected communities, it is very likely that a short path connecting their members 
exists. 

To investigate this aspect, in the following we analyze the effective diameter and the shortest paths 
distribution in the community structure. To this purpose, Figure [7] reports the cumulative distribution 
function of the probability that two arbitrary communities are connected by a given number of hops. The 
meaning of the cumulative distribution function (cdf), defined as F(x) — Pi~(X < x), is the probability 
that a random variable X assumes values below a given x. In that sense, from Figure[7]it emerges that all 
communities are connected in a number of hops^Jof 6, and most interestingly, that the highest advantage 
in terms of probability gain of connecting two randomly chosen communities, is obtained considering 
hops of length 3. 

This aspect is further investigated as follows: Figure [8] represents the probability distribution for the 
shortest paths against the path length. The interesting behavior which emerges from the analysis is that 
the shortest path probability distribution reaches a peak for paths of length 2 and 3. In correspondence 
with this peak, the number of connected pairs of communities quickly grows, reaching the effective 
diameter of the networks (cfr. Figure [7]). This findings has an important impact on the features of the 
overall social graph. In fact, if we would suppose that all nodes belonging to a given community are 
well connected each other, or even directly connected, this would result in a very short diameter of the 
social graph itself. In fact, there will always exist a very short path connecting the communities of any 
pair of randomly chosen members of the social network. Interestingly, this hypothesis is substantiated by 
recent studies by Facebook, who used heuristic techniques to measure the average diameter of the whole 



network 33 34 . Surprisingly, their outcomes are very similar to our results: they estimated an average 
diameter of 4.72 while the effective diameter of the community structure for our uniform sample is 4.45 
and 4.85, respectively for LPA and FNCA. 

Thus, we conclude the characterization of the mesoscopic features of the community structure discussing 
the distribution of weights and strength of links among communities. The importance of this kind of 
analysis rises considering some social conjectures, like the Granovetter's strength of weak ties theory [18] , 
that rely on the assessment of the strength of links in the social networks. To this purpose, we resemble 
that the strength s u [v) (or weighted degree) of a given node v is determined as the sum of the weights of 
all edges incident on v 

S » = £ W (e) 

eel(v) 



where w(e) is the weight of a given edge e and I(v) the set of edges incident on v. 

In Figure [9] we plotted the probability distribution of both weight and strength on links among communi- 
ties. In both cases, once again it emerges a power law behavior. In particular, the distribution of weights 
is defined by a single regime power law P(x) = ir~ 7 described by a coefficient 7 = 1.45. The strength 
distribution is better described by a power law with two different regimes, in those intervals similar to the 
degree probability distribution (i.e., tentatively 1 < x < 10 2 and x > 10 2 ), by two coefficients 7 = 1.50 
and 7 = 3.12. 

Given the definition of weights for the meta-network, as in Equation [l] (i.e., the sum of total number of 
edges connecting all users belonging to the two connected communities), we can suggest the hypothesis 
that there exists a high probability of finding a large number of pairs of communities whose members 
are not directly connected, and a increasingly smaller number of pairs of communities whose members 
are highly connected each other. These connections, which are usually referred as to weak ties, according 



4 To this regard, we put into evidence that the x-axis is reversed and we recall that the diameter of the considered 
community structures is 4.45 and 4.85, respectively for LPA and FNCA. 
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to the strength of weak ties theory, are characterized by a smaller strength but a hightened tendency to 
proficiently connect communities otherwise disconnected. This aspect is further discussed in the following. 



Connettivity among communities 

The last experiment discussed in this paper is devoted to understanding the density of links connecting 
communities in Facebook. In particular, we are interested in defining to what extent links connect 
communities of comparable or different size. To do so, we considered each edge in the meta-network and 
we computed the size of the community to which the source node of the edge belongs to. Similarly, we 
computed the size of the target communitjj^J 



Figure 10 represents a probability density map of the distribution of edges among communities. First, we 
highlight that the map is symmetric with respect to the diagonal, according to the fact that the graph is 
undirected and each edge is counted twice, once for each end-vertex. From the analysis of this figure, it 
clearly emerges that edges mainly connect two types of communities: (i) communities of small size, each 
other - this is the most common case; (ii) communities of small size with communities of large size - less 
likely to happen but still statistically significant. 

To a certain extent, this could be intuitive since the number of communities of small size, according to 
their power law distribution, is much greater than the number of large communities. On the other hand, 
it is an important assessment since similar results have been recently described for Twitter [5j, in the 
context of the evaluation of the Granovetter's strength of weak ties theory 18| 6 ' 



In fact, according to this theory, weak links typically occur among communities that do not share a large 
amount of neighbors, and are important to keep the network proficiently connected. 



Inter and intra-community links 

For further analysis, we evaluate the amount of edges that fall in each given community with respect to 



its size. The results of this assessment are reported in Figure 11 The interpretation of this plot is the 
following: on the y-axis it is represented the fraction of edges per community as a function of the size 
of the community itself, reported on the x-axis. It emerges that also the distribution of the link fraction 
against the size of the communities resembles a power law. 

Indeed, this result is different from that recently proved for Twitter [5], in which a Gaussian-like distri- 
bution has been discovered. This is probably due to the intrinsic characteristics of the networks, that 
are topologically dissimilar (i.e., Twitter is represented by a directed graph with multiple type of edges) 
and also the interpretation itself of social tie is different. In fact, Twitter represents in a way hierarchical 
connections - in the form of follower and followed users - while Facebook tries to reflects a friendship 
social structure which better represents the community structure of real social networks. 

The emergence of this scaling law is important with regard to the organization patterns that are reflected 
by individuals participating to large scale social networks. In fact, it seems that users that constitute 
small communities are generally very well connected to other communities, while large communities of 
individuals seem to be linked in a less efficient way to other communities. This is reflected by the small 
number of weak ties incident on communities of large size with respect to the number of individuals they 
gather. These findings are relevant since they testify that individuals, even on a large scale, are able to 



5 We recall that, being the network model adopted undirected, the meaning of source and target node is only instrumental 
to identify the end-vertex of each given edge. 

6 The roles of weak ties is to connect small communities of acquaintances which are not that close to belong to the same 
community but, on the other hand, are somehow proficiently in contact. 
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achieve high levels of proficiency in self-organization, in order to maximize their ability to efficiently get 
in touch and communicate with a large numbers of users. 



Visual observation of the community meta-network 

The visual analysis of large-scale networks is usually unfeasible when managing samples whose size is 
in the order of millions of entities. Even though, by adopting our technique of building a community 
structure meta-network, it is yet possible to study the mesoscopic features of the Facebook social network 
from an unprecedented perspective. To this purpose, for example, social network analysts may be able 
to infer additional insights about the structure of the original network from the visual analysis of its 
community structure. 

In Figure 12 obtained by using Cvi^]- a hierarchical-based circular visualization algorithm -, we repre- 
sent the community structure unveiled by LPA in the uniform sample. From its analysis, it is possible 
to appreciate the existence of a tight core of communities which occupy a central position into the meta- 
network 1 19, 20 . A further inspection of the features of these communities revealed that their positioning 
is generally irrespective of their size. This means that there are several different small communities which 
play a dominant role in the network, which is in agreement with previous findings and highlight the 
role of self-organization even on such a large scale. Similar considerations hold for the periphery of the 
network, which is constituted both by small and larger communities. 

Finally, we highlight the presence of so-called weak ties, that proficiently connect communities that 
otherwise would be far each other. In particular, those that connect communities in the core with 



communities in the periphery of the network, according to the strength of weak ties theory 18 , represent 
the most important patterns along which communications flow, that enhance users ability of getting in 
touch with each other, efficiently spreading information, and so on. 



Implications 

A summary of the implications of the results achieved with our analysis of the Facebook community 
structure follows. 

First of all, in this paper we put into evidence that the community structure of the Facebook social 
network presents a clear power law distribution of the dimension of the communities, similarly to other 



large social networks 13 . This result is independent with respect to the algorithm adopted to discover 
the community structure, and even (but in a less evident way) to the sampling methodology adopted to 
collect the samples. On the other hand, this is the first experimental work that proves on a large scale 
the hypothesis, theoretically advanced by [30], of the possible bias towards high degree nodes introduced 
by the BFS sampling methodology for incomplete sampling of large networks. 

Regarding the qualitative analysis of our results, it emerges that the communities share a high degree of 
similarity among different samples, which means that they emerge clearly in the topology of the network. 

The analysis of the community structure meta-network puts into evidence different mesoscopic features. 
We discovered that the community structure is characterized by a power law probability distribution of 
community degree that puts into evidence the tendency to self-organization of users into communities 
that efficiently maximize their ability to get in touch in few steps. 



7 https:/ /sites. google.com/site/andrealancichinetti/cvis 
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Our further analysis highlights that there exists a tendency to the creation of short-paths (whose length 
mainly consists of two or three hops), that proficiently connect the majority of the communities existing 
in the network. This implies the existence of some kinds of weak ties in the Granovetter's sense [18] , that 
interconnect communities of users which alternatively would be far from each other. 

The analysis we carried out puts into evidence both the presence of these weak ties among communities, 
and their distribution with respect to the size of the communities themselves. What emerged is that 
users mainly aggregate together in communities of small size but, on the other hand, that these small 
communities are very well interconnected each other, allowing those individuals to get in touch very 
proficiently with other users, also far from their network of close friends. 

To the best of our knowledge, this is the first work that unveils such kind of social dynamics on a large 
scale in the context of online social networks. 



Results in context with previous literature 

Several recent studies focus on the analysis of the community structure of different social networks 



13 14 35 54 . An in-depth analysis of the Facebook collegiate networks has been carried out in [14] . 
Authors considered data collected from 5 American colleges and examined how the online social lives 
reflect the real social structure. They proved that the analysis of the community structure of online social 
networks is fundamental to obtain additional insights about the prominent motivations which underly the 
community creation in the corresponding real world. Moreover, authors found that the Facebook social 
network shows a very tight community structure, providing high values of network modularity. Some of 
their findings are confirmed in this study on a large scale. 



Recently 13 , it has been put into evidence that the community structure of social networks shares 
similarities with communication and biological networks. Authors investigated several mesoscopic features 
of different networks, such as community size distribution, density of communities and the average shortest 
path length, finding that these features are very characteristic of the network nature. According to their 
findings, we assessed that also Facebook is well-described by some specific characteristics on a mesoscopic 
level. 



Regarding the mesoscale structure analysis of social networks, 35 provided a study by comparing three 
state-of-the-art methods to detect the community structure on large-scale networks. An interesting aspect 
considered in that work is that two of the three considered methods can detect overlapping communities, 
so that a differential analysis has been carried out by the authors. They focused on the analysis of 
several mesoscopic features such as the community size and density distribution and the neighborhood 
overlapping. In addition, they verified that results obtained by the analysis of synthetic networks are 
profoundly different from that obtained by analyzing real-world datasets, in particular regarding the 
community structure, putting into evidence the emergence of need of studying online social networks 
acquiring data from the real platforms. Their findings are also confirmed in this study, in which we 
acquired a sample of the social graph directly from the Facebook platform. 

An interesting work which is closely related to this study regards the assessment of the strength of weak 
ties theory in the context of Twitter 51. In that work, it emerges that one of the roles of weak ties is to 
connect small communities of acquaintances which are not that close to belong to the same community 
but, on the other hand, are somehow proficiently in contact. Clues in this direction comes also from this 
study, even though the intrinsic characteristics of these two networks, that are topologically dissimilar 
(i.e., Twitter is represented by a directed graph with multiple type of edges) and also carry a different 
interpretation of social ties themselves. In fact, social ties in Twitter represent hierarchical connections 
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(in the form of follower and followed users) , while Facebook tries to reflects a friendship social structure 
which better represents the community structure of real-world social networks. 

Concluding, recently [54] the perspective of the study of the community structure has been reinvented 
considering the problem of the detecting communities of edges instead of the classical communities of 
nodes. This approach shows a nice feature, i.e., that link communities intrinsically incorporate the 
concept of overlap. Thus, the authors findings are applied to large social networks of mobile phone calls 
confirming the emergence of power law distributions also for link community structures. Similar studies 
could be extended to online social networks like Facebook, in order to investigating the existence of 
particular communication patterns or motifs. 

Strength and limitations of this study 

In the following we discuss the main strengths and limitations of this study. To the best of our knowl- 
edge, this is the first work that investigates the general mesoscopic structure of a large online social 
network. This is particularly interesting since it is opposed to just trying to identify dense clusters in 
large communities, which is the aim of different works discussed above. 

This work highlights the possibility of inferring characteristics describing social dynamics and organization 
patterns ongoing on large scale social networks, analyzing some mesoscopic features that arise from a 
statistical and topological investigation. This kind of analysis has been recently carried out for some 
types of social media platforms (such as Twitter |5|) which capture different nuances of relations (for 
example, hierarchical follower- followed user relations), but there was a lack in literature regarding online 
social network platforms reflecting friendship relations, such as Facebook. This work tries to fill this gap, 
provides results that well relate with those presented in recent literature, and describes novel insights on 
the problem of characterizing social network structure on the large scale. 

We can already envision two limitations of this work, which leave space for further investigation. First, 
our sample purely relies on binary friendship relations, which represent the simplest way to capture the 
concept of friendship on Facebook. On the other hand, there could be more refined representations of the 
Facebook social graph, such as taking into consideration the frequency of interaction among individuals 
of the network, to weight the importance of each tie. To this purpose, the feasibility of this study is 
highly complicated by the privacy issues deriving from accessing more private information about users 
habits (such as the frequency of interaction with their friends) , which limit our range of study. 

Depending on this aspect, the second shortcoming of this study rises. In detail, the fact that we were con- 
cerned with the analysis of publicly accessible profiles, and that we investigated the impact of restrictive 
privacy settings in previous works [40] , implies that our sample only reproduces a picture of the Facebook 
social network which is partial and could slightly vary with respect to the overall social graph. To this 
purpose, another aspect which deserves more investigation is understanding how the incompleteness of 
the sampling affects the characteristics of the community structure. In fact, even though we assessed the 
statistical significance of our results, the impossibility of comparing our sample against the actual overall 
graph limits the investigation of the bias introduced by the sampling process. 

Conclusions 

The aim of this work was to investigate the emergence of social dynamics, organization patterns and 
mesoscopic features in the community structure of a large-scale online social network such as Facebook. 
This task was quite thrilling and not trivial, since a number of theoretical and computational challenges 
raised. 
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First of all, we collected real- world data directly from the online network. In fact, as recently put into 
evidence in literature |35| , the differences between synthetic data and real, large-scale, online social 
networks have profund implications on results. 

After we reconstructed a statistically significant sample of the structure of the social graph of Facebook, 
we unveiled its community structure. The main findings that emerged from the mesoscopic analysis of 
the community structure of this network can be summarized as follows: 

(i) We assessed the tendency of online social network users to constitute communities of small size, proving 
the presence of a decreasing number of communities of larger size. This behavior follows a power law 
distribution P(k) oc k -7 with 0.72 < 7 < 1.07, depending on the sampling methodology, and highlights 
the tendency of users to self-organization even in the context of large scale online social networks. 

(ii) We investigated the occurrence of connections among communities, finding that also this mesoscopic 
feature is well-described by a power law distribution, P(k) oc k 1 with 7 = 2.45. This finding testifies the 
importance of some kind of links, commonly referred as to weak ties, that proficiently connect communities 



each other, in agreement with the Granovcttcr's strength of weak ties theory 18 and with recent studies 
on other online social networks |5|. 



(Hi) Regardless the adopted sampling methodology (that could introduce bias 30 ) and the clustering 



algorithm (that could introduce bias 32 or suffer of the well-known problem of the resolution limit [31]), 



the community structure clearly emerges, supporting the significance of results of this study. 

(iv) The community structure is highly clusterized, indicating the presence of the small world phe- 
nomenon, which characterizes real-world social networks, according to classical sociological studies envi- 



sioned by Milgram 17 



(v) The diameter of the community structure network is approximately around 4 and 5, which is in agree- 
ment with the well-know six- degrees of separation theory and perfectly reflects some heuristic evaluations 



recently provided by Facebook 33 34 



The achieved results open space for further studies in different directions. As far as it concerns our 
long-term future research directions, we plan to investigate, amongst others, the following issues: 

(i) Devising a model to identify the most representative users inside each given community. This would 
leave space for further interesting applications, such as the maximization of advertising on online social 
networks, the analysis of communication dynamics, spread of influence and information and so on. 

(ii) Exploiting geographical data regarding the physical location of users of Facebook, to study the 



effect of strong and weak ties in the society 18 . In fact, is it known that a relevant additional source 



of information is represented by the geographical distribution of individuals 55-57 . For example, wc 
suppose that strong ties could reflect relations characterized by physical closeness, while weak ties could 
be more appropriate to represent connections among physically distant individuals. 

(Hi) We aim at merging information from different networks {e.g., social and geographical) and exploiting 
them to get additional insights about the structure of the network and the role of nodes and edges in 
social dynamics and in organization patterns. 

(iv) Concluding, we devised a strategy to estimate the strength of ties between two social network 
users [58] and we want to study its application to online social networks on a large scale. In the case 
of social ties, this is equivalent to estimate the friendship degree between a pair of users by considering 
their interactions and their attitude to exchange information. 
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Appendix 

In this appendix we shortly discuss the background in community detection algorithms and explain the 
functioning of the two community detection methods adopted during our experimentation, namely LPA 
and FNCA. 



Community detection in complex networks 



The problem of discovering the community structure of a network has been approached in several different 
ways. A common formulation of this problem is to find a partitioning V — (Vi U V% U • • • U V n ) of disjoint 
subsets of vertices of the graph G — {V, E) representing the network (in which the vertices represent the 
users of the network and the edges represent their social ties) in a meaningful manner. 

The most popular quantitative measure to prove the existence of an emergent community structure in a 
network, called network modularity, has been proposed by Girvan and Newman 59 60 . It is defined as 



the sum of the difference between the fraction of edges falling in each given community and the expected 
fraction if they were randomly distributed. Let consider a network which has been partitioned into m 
communities; its value of network modularity is 



Q = J2 



\E\ 



2E 



(5) 



assuming l s the number of edges between vertices belonging to the s-th community and d s the sum of 
the degrees of the vertices in the s-th community. High values of Q imply high values of Z s for each 
discovered community. In that case, detected communities are dense within their structure and weakly 
coupled among each other. 

Partitioning a network in disjoint subsets may arise some difficulties. In fact, each user in the network 
possibly belongs to several different communities; the problem of overlapping community detection has 
recently received a lot of attention (see (28]-) Moreover, may exist networks in which a certain individual 
may not belong to any group, remaining isolated, as recently put into evidence by Hunter et al. [61] . Such 
a case commonly happens in real and online social networks, as reported by recent social studies 62 . 



Community detection techniques 

In its general formulation, the problem of finding communities in a network is solvable assigning each 
vertex of the network to a cluster, in a meaningful way. There exist different paradigms to solve this 



problem, such as the spectral clustering 63 64 which relies on optimizing the process of cutting the 



graph, and the network modularity maximization methods. 

Regarding spectral clustering techniques, they have an important limitation. They require a prior knowl- 
edge on the network, to define the number of communities present in the network and their size. This 
makes them unsuitable if the aim is to unveil the unknown community structure of a given network. 

As for network modularity maximization techniques, the task of maximizing the objective function Q has 



been proved NP-hard 65 , thus several heuristic techniques have been presented during the last years. 
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The Girvan-Newman algorithm 59 60 66 is an example. It exploits the assumption that it is possible 



to maximize the value of Q deleting edges with a high value of betweenness, starting from the intuition 
that they connect vertices belonging to different communities. Unfortunately, the cost of this algorithm 
is 0(n 3 ), being n the number of vertices in the network; it is unsuitable for large-scale networks. A 
tremendous amount of improved versions of this approach have been provided in the last years and are 
extensively discussed in [25|[26] , 



From a computational perspective, some of the state-of-the-art algorithms are Louvain method 67 68 



LPA 43 69 , FNCA 44 and a voltage-based divisive method [70] . All these algorithms provide with 
near linear computational costs. 

Recently, the problem of discovering the community structure in a network including the possibility of 
finding overlapping nodes belonging to different communities at the same time, has acquired a lot of 
attention by the scientists because of the seminal paper presented by Palla et al. 71 . A lot of efforts 



have been spent in order to advance novel possible strategies. For example, an interesting approach has 
been proposed by Gregory [72] , that is based on an extension of the Label Propagation Algorithm adopted 
in this work. On the other hand, an approach in which the hierarchical clustering is instrumental to find 



the overlapping community structure has been proposed by Lancichinetti et al. 73 74 



Label Propagation Algorithm (LPA) 

The LPA (Label Propagation Algorithm) [43] is a near linear time algorithm for community detection. Its 
functioning is very simple, considered its computational efficiency. LPA uses only the network structure 
as its guide, is optimized for large-scale networks, does not follow any pre-defined objective function 
and does not require any prior information about the communities. Labels represent unique identifiers, 
assigned to each vertex of the network. 

Its functioning is reported as described in [43] : 

Step 1 To initialize, each vertex is given a unique label; 

Step 2 Repeatedly, each vertex updates its label with the one used by the greatest number of neighbors. 
If more than one label is used by the same maximum number of neighbors, one is chosen randomly. 
After several iterations, the same label tends to become associated with all the members of a 
community; 

Step 3 Vertices labeled alike are added to one community. 

Authors themselves proved that this process, under specific conditions, could not converge. In order to 
avoid deadlocks and to guarantee an efficient network clustering, we accept their suggestion to adopt an 
asynchronous update of the labels, considering the values of some neighbors at the previous iteration and 
some at the current one. This precaution ensures the convergence of the process, usually in few steps. 



Raghavan et al. 43 ensure that five iterations are sufficient to correctly classify 95% of vertices of the 
network. After some experimentation, we found that this forecast is too optimistic, thus we elevated the 
maximum number of iterations to 50, finding a good compromise between quality of results and amount 
of time required for computation. 

A characteristic of this approach is that it produces groups that are not necessarily contiguous, thus it 
could exist a path connecting a pair of vertices in a group passing through vertices belonging to different 
groups. Although in our case this condition would be acceptable, we adopted the suggestion of the 
authors to devise a final step to split the groups into one or more contiguous communities. 



The authors proved its near linear computational cost 



43 



19 



Fast Network Community Algoritm (FNCA) 

FNCA (Fast Network Community Algorithm) [44| is a modularity maximization algorithm for community 
detection, optimized for large-scale social networks. 

Given an unweighted and undirected network G — (V, E), suppose the vertices are divided into communi- 
ties such that vertex i belongs to community r(i) denoted by c r (i); the function Q is defined as Equation 
[6j where A = (-Ay) nX n is the adjacency matrix of network G. Ay — 1 if node i and node j connect each 
other, Aij — otherwise. The 6 function S(u, v) is equal to 1 if u = v and otherwise. The degree ki of 
any vertex i is defined to be fcj = A^ and m = ^ Ylij Aij is the number of edges in the network. 

We convert Equation [6] to Equation [7J which takes the function Q as the sum of functions / of all 
nodes. The function / can be regarded as the difference between the number of edges that fall within 
communities and the expected number of edges that fall within communities, from the local angle of any 
node in the network. The function / of each node can measure whether a network division indicates a 
strong community structure from its local point of view 



!b5> a- E m 



2m \ 2m 

i j6c r( , 

The authors |44| proved that: (i) any node in a network can evaluate its function / only by using local 
information (the information of its community); (ii) if the variety of some nodes label results in the 
increase of its function / and the labels of the other nodes do not change, the function Q of the whole 
network will increase too. The community detection algorithm used is based on these assumptions. It 
makes each node maximize its own function / by using local information in the sight of local view, which 
will then achieve the goal that optimize the function Q. 

Moreover, in complex networks with a community structure, holds true the intuition that any node should 
have the same label with one of its neighbors or it is itself a cluster. Therefore, each node does not need 
to compute its function / for all the labels at each iteration, but just for the labels of its neighbors. This 
improvement not only decreases the time complexity of the algorithm, but also makes it able to optimize 
the function Q by using only local information of the network community structure. 

It has been proved that this algorithm, under certain conditions, could not quickly converge, thus we 
introduced an iteration number limitation T as additional termination condition. Experimental results 
show that, the clustering solution of FNCA is good enough before 50 iterations for most large-scale 
networks. Therefore, iteration number limitation T is set at 50 in all the experiments in this paper. 



Authors proved the near linear cost of this algorithm 44 . 
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Figure 1. FNCA vs. LPA (uniform sample) community size probability distribution. 



Table 1. BFS and uniform samples description. 



Feature 


BFS 


uniform 


No. visited users 


63.4K 


48.1K 


No. discovered neighbors 


8.21M 


7.69M 


No. total edges 


12.58M 


7.84M 


Size largest connected component 


98.98% 


94.96% 


Avg. degree (visited users) 


396.8 


326.0 


2nd largest eigenvalue 


68.93 


23.63 


Effective diameter 


8.69 


14.72 


Avg. clustering coefficient 


1.88 • icr 2 


1.40- 10~ 2 


Density 


0.626% 


0.678% 



In this table we report some statistics regarding the two samples, BFS and uniform, which have been 
collected during August 2010 from the Facebook social network. 
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Figure 2. FNCA vs. LPA (BFS sample) community size probability distribution. 
Table 2. Results on Facebook network samples. 



Algorithm 


No. Communities Network modularity 


Time (s) 


BFS (8.21M vertices, 12.58M edges) 


FNCA 


50,156 0.6867 


5.97- 10 4 


LPA 


48,750 0.6963 


2.27- 10 4 


uniform (7.69M vertices, 7.84M edges) 


FNCA 


40,700 0.9650 


3.77- 10 4 


LPA 


48,022 0.9749 


2.32 • 10 4 



This table summarizes performance and results of the two chosen community detection algorithms (i.e., 
FNCA and LPA) applied to the samples we collected from Facebook. 

Table 3. Representation of a community structure. 



Community-ID 




List of Members 




community-ID i 


{user 


-ID a ; user-ID 6 ; . . 


. ; user 


ID C } 


community-ID2 


{user 


-IDf, user-ID^; . . 


. ; user- 


ID fc } 






{...} 






community-ID jv 


{user- 


ID X ; user-ID^; . . 


. ; user 


-IDJ 



To represent the community structure discovered in each sample we adopted the format reported in this 
table. 
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Figure 3. Heat-map of similarity (uniform sample). 
Table 4. Similarity degree of community structures 



Degree of Similarity FNCA vs. LPA 



Metric 


Sample 


Common 


Mean 


Median 


Std. D. 


J 


BFS 

uniform 


2.45% 
35.57% 


73.28% 
91.53% 


74.24% 
98.63% 


18.76% 
15.98% 



In this table we report the results obtained computing the similarity between the community structure 
discovered by using FNCA and LPA in the BFS and uniform samples, computed by means of the 
binary Jaccard coefficient. 

Table 5. Amount of outlier communities. 



Amount with respect to Number of Members 


Set 


Alg. 


> IK 


> 5K 


> 10K 


> 50K 


> iooa: 


BFS 


FNCA 


4 


1 


2 


1 


1 




LPA 


1 





2 





1 


uniform 


FNCA 


81 
















LPA 


















We defined as outlier communities those communities whose size significantly exceeds what would be 
expected by the power law distribution describing this feature. Outliers community are reported in this 
table for each sample (i.e., BFS and uniform) and community detection method (i.e., FNCA and LPA). 
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Figure 4. Heat-map of similarity (BFS sample) 



Table 6. Features of the meta-networks representing the community structure for the 
uniform sample. 



Feature FNCA LPA 

No. nodes/edges 36,248/836,130 35,276/785,751 

Min./Max./Avg. weight 1/16,088/1.47 1/7,712/1.47 

Size largest conn. comp. 99.76% 99.75% 

Avg. degree 46.13 44.54 

2nd largest eigenvalue 171.54 23.63 

Effective diameter 4.85 4.45 

Avg. clustering coefficient 0.1236 0.1318 

Density 0.127% 0.126% 



In this table we report some statistics regarding the community structure meta-network obtained from 
the uniform sample, by using the two chosen community detection algorithms (i.e., FNCA and LPA). 
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Figure 5. Meta-network degree probability distribution (uniform sample). 
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Figure 6. Meta-network clustering coefficient distribution (uniform sample). 
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Figure 8. Meta-network shortest paths probability distribution (uniform sample). 
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Figure 10. Probability distribution map of links between communities of a given size 
(uniform sample, LPA clustering). 
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Figure 11. Inter-community edge fraction distribution (uniform sample, LPA clustering). 
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Figure 12. Meta-network representing the community structure (uniform sample, LPA 
clustering). 



